23 research outputs found
Representation-Driven Reinforcement Learning
We present a representation-driven framework for reinforcement learning. By
representing policies as estimates of their expected values, we leverage
techniques from contextual bandits to guide exploration and exploitation.
Particularly, embedding a policy network into a linear feature space allows us
to reframe the exploration-exploitation problem as a
representation-exploitation problem, where good policy representations enable
optimal exploration. We demonstrate the effectiveness of this framework through
its application to evolutionary and policy gradient-based approaches, leading
to significantly improved performance compared to traditional methods. Our
framework provides a new perspective on reinforcement learning, highlighting
the importance of policy representation in determining optimal
exploration-exploitation strategies.Comment: Accepted to ICML 202
A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits
Algorithms for offline bandits must optimize decisions in uncertain
environments using only offline data. A compelling and increasingly popular
objective in offline bandits is to learn a policy which achieves low Bayesian
regret with high confidence. An appealing approach to this problem, inspired by
recent offline reinforcement learning results, is to maximize a form of lower
confidence bound (LCB). This paper proposes a new approach that directly
minimizes upper bounds on Bayesian regret using efficient conic optimization
solvers. Our bounds build on connections among Bayesian regret, Value-at-Risk
(VaR), and chance-constrained optimization. Compared to prior work, our
algorithm attains superior theoretical offline regret bounds and better results
in numerical simulations. Finally, we provide some evidence that popular
LCB-style algorithms may be unsuitable for minimizing Bayesian regret in
offline bandits
Modeling Recommender Ecosystems: Research Challenges at the Intersection of Mechanism Design, Reinforcement Learning and Generative Models
Modern recommender systems lie at the heart of complex ecosystems that couple
the behavior of users, content providers, advertisers, and other actors.
Despite this, the focus of the majority of recommender research -- and most
practical recommenders of any import -- is on the local, myopic optimization of
the recommendations made to individual users. This comes at a significant cost
to the long-term utility that recommenders could generate for its users. We
argue that explicitly modeling the incentives and behaviors of all actors in
the system -- and the interactions among them induced by the recommender's
policy -- is strictly necessary if one is to maximize the value the system
brings to these actors and improve overall ecosystem "health". Doing so
requires: optimization over long horizons using techniques such as
reinforcement learning; making inevitable tradeoffs in the utility that can be
generated for different actors using the methods of social choice; reducing
information asymmetry, while accounting for incentives and strategic behavior,
using the tools of mechanism design; better modeling of both user and
item-provider behaviors by incorporating notions from behavioral economics and
psychology; and exploiting recent advances in generative and foundation models
to make these mechanisms interpretable and actionable. We propose a conceptual
framework that encompasses these elements, and articulate a number of research
challenges that emerge at the intersection of these different disciplines
Ranking with Popularity Bias: User Welfare under Self-Amplification Dynamics
While popularity bias is recognized to play a role in recommmender (and other
ranking-based) systems, detailed analyses of its impact on user welfare have
largely been lacking. We propose a general mechanism by which item popularity,
item quality, and position bias can impact user choice, and how it can
negatively impact the collective user utility of various recommender policies.
Formulating the problem as a non-stationary contextual bandit, we highlight the
importance of exploration, not to eliminate popularity bias, but to mitigate
its negative effects. First, naive popularity-biased recommenders are shown to
induce linear regret by conflating item quality and popularity. More generally,
we show that, even in linear settings, identifiability of item quality may not
be possible due to the confounding effects of popularity bias. However, under
sufficient variability assumptions, we develop an efficient UCB-style algorithm
and prove efficient regret guarantees. We complement our analysis with several
simulation studies